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Neural  mechanisms  of  object  recognition 

Maximilian  Riesenhuber*  and  Tomaso  Poggiot 


Single-unit  recordings  from  behaving  monkeys  and  human 
functional  magnetic  resonance  imaging  studies  have  continued 
to  provide  a  host  of  experimental  data  on  the  properties  and 
mechanisms  of  object  recognition  in  cortex.  Recent  advances 
in  object  recognition,  spanning  issues  regarding  invariance, 
selectivity,  representation  and  levels  of  recognition  have  allowed 
us  to  propose  a  putative  model  of  object  recognition  in  cortex. 
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Abbreviations 

FFA  fusiform  face  area 

fMRI  functional  magnetic  resonance  imaging 

IT  inferotemporal  cortex 

Max  maximum 

PFC  prefrontal  cortex 

RBF  radial  basis  function 

VI  primary  visual  cortex 

V2  secondary  visual  cortex 

Introduction 

Object  recognition  is  fundamental  to  the  behavior  of  higher 
primates.  It  is  also  the  most  remarkable  achievement  of  the 
visual  cortex  and  one  that  probably  greatly  influences  its 
functional  architecture.  The  visual  system  rapidly  and 
effortlessly  recognizes  a  large  number  of  diverse  objects  in 
cluttered,  natural  scenes  —  a  very  difficult  computational 
task.  Here,  we  review  progress  in  this  field  over  the  past 
two  years.  We  do  so  in  the  context  of  a  recent  quantitative 
model,  which  helps  us  summarize  and  organize  existing 
data  as  well  as  interpret  contradictory,  and  occasionally  ill- 
defined,  claims.  We  organize  the  discussion  of  the  new  data 
around  the  four  key  issues  of  object  recognition:  invariance, 
selectivity,  object  representation  and  levels  of  recognition. 

Invariance 

Simple  cells  in  primary  visual  cortex  (VI)  have  small 
receptive  fields  and  respond  preferentially  to  oriented 
bars.  Progressing  along  the  ventral  stream  —  thought  to 
play  a  central  role  in  object  recognition  in  cortex  [1,2]  — 
neurons  show  an  increase  in  receptive  field  size  and  in  the 
complexity  of  their  preferred  stimuli  [3].  At  the  top  of  the 
ventral  stream,  in  the  inferotemporal  cortex  (IT),  cells  are 
tuned  to  complex  stimuli  such  as  faces  [4-7].  A  hallmark  of 
these  IT  cells  is,  in  addition  to  selectivity,  the  robustness 
of  their  firing  to  stimulus  transformations,  such  as  scale 


and  position  changes  [1,2, 8, 9].  In  contrast,  later  studies 
[8,10-12]  have  shown  that  most  neurons  show  specificity 
for  a  certain  object  view  or  lighting  condition.  In  particular, 
Logothetis  et  al.  [8]  trained  monkeys  to  perform  an  object 
recognition  task  with  isolated  views  of  novel  objects 
(paperclips).  When  recording  from  the  animals’  IT,  they 
found  that  the  great  majority  of  neurons  selectively  tuned 
to  the  training  objects  showed  tight  tuning  to  a  specific 
view  of  one  of  the  training  objects  (a  few  units  showed 
greater  tolerance,  in  agreement  with  earlier  predictions 
[13]).  The  view-tuned  neurons  also  showed  an  average 
scale  invariance  of  two  octaves.  That  is,  the  neurons  still 
responded  at  a  higher  level  to  the  scaled  image  of  their 
preferred  paperclip  than  to  other  paperclips,  even  when 
stimulus  size  was  varied  over  two  octaves.  Furthermore, 
the  view-tuned  neurons  had  an  average  translation  invari¬ 
ance  of  4°  (for  typical  stimulus  sizes  of  2°)  [14],  which  is 
much  smaller  than  previous  reports,  but  large  for  any 
computational  mechanism.  A  very  recent  study  (JJ  DiCarlo, 
JHR  Maunsell,  personal  communication),  using  different 
stimuli  and  training  paradigms,  reports  translation  invari¬ 
ance  from  one  view  of  less  than  3°,  pointing  to  a  possible 
influence  of  training  history  and  object  shape  on  invariance 
ranges.  Human  functional  magnetic  resonance  imaging 
(fMRI)  data  have  shown  a  similar  pattern  of  invariance 
properties  for  the  lateral  occipital  cortex,  a  brain  region  in 
human  visual  cortex  central  to  object  recognition  and 
believed  to  be  the  homolog  of  monkey  area  IT  [15-17]. 

From  a  computational  point  of  view  one  might  ask  the 
question:  which  object  transformations  can  be  estimated 
from  one  versus  several  object  views?  It  is  well  known  that 
only  a  very  small  number  of  views  are  required  to  generalize 
object  recognition  across  different  uniform  transformations 
[18*  and  references  therein].  Scaling  and  translation  in  the 
image  plane,  for  instance,  solely  require  a  single  object 
view,  as  they  preserve  the  original  information  of  an  image. 
In  this  case,  it  is  possible  to  dispense  with  the  need  for 
additional  examples  of  different  sizes  or  positions  in  the 
field  of  view.  In  sharp  contrast,  multiple  views  are  generally 
required  to  recognize  objects  subjected  to  three-dimen¬ 
sional  shape  transformations,  whether  actual  —  such  as  the 
rotation  of  objects  in  depth  —  or  induced  —  such  as  those 
resulting  from  illumination  changes.  The  frontal  view  of 
a  novel  face,  for  instance,  does  not  contain  sufficient 
information  to  predict  the  profile  of  that  face. 

Computational  considerations  such  as  these  lead  to  a 
hierarchical  architecture  of  a  system  for  object  recognition 
that  instantiates  the  basic  facts  about  the  ventral  pathways 
of  the  brain  [18*].  The  model  shown  schematically  in 
Figure  1  reflects  the  general  organization  of  visual  cortex 
in  a  series  of  layers  from  VI  — >  IT  — »  prefrontal  cortex 
(PFC).  Invariance  properties  emerge  from  the  functional 
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Figure  1 
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Model  of  the  architecture  of  recognition  in  the  cortex  [1 8*].  The  model 
combines  and  extends  several  recent  models  [9,13,14,55,56]  and 
effectively  summarizes  many  experimental  findings.  A  view-based  module 
[1 4],  consisting  of  a  hierarchical  extension  of  the  classical  paradigm  of 
building  complex  cells  from  simple  cells  [57].  The  hierarchy  of  layers 
have  two  different  types  of  pooling  mechanisms.  The  first  layer  in  VI 
represents  linear  oriented  filters  similar  to  simple  cells;  each  unit  in  the 
next  layer  pools  the  outputs  of  simple  cells  of  the  same  orientation  but  at 
slightly  different  positions  (scales).  Each  of  these  units  is  still  orientation- 
selective  but  more  invariant  to  position  (scale),  similarly  to  some  complex 
cells.  In  the  next  stage,  signals  from  complex  cells  with  different 
orientations  but  similar  positions  are  combined  to  create  neurons  (S2) 
tuned  to  a  small  dictionary  of  more  complex  features.  The  next  layer  is 
equivalent  to  complex  cells  in  VI :  by  pooling  together  signals  from  S2, 
cells  of  the  same  type  but  at  slightly  different  positions,  the  C2  units 
become  more  invariant  to  position  (and  scale)  but  preserve  feature 
selectivity.  They  may  correspond  roughly  to  V4  cells.  In  the  model,  the 
C2  cells  feed  into  view-tuned  cells  [Vn),  with  connection  weights  that  are 
learned  from  exposure  to  a  view  of  an  object.  There  may  be  more  levels 
in  this  hierarchy,  after  the  C2  layer.  The  key  idea  in  the  view-tuned 
module  alternates  two  types  of  pooling:  the  first  to  provide  increasing 
pattern  selectivity  (blue  lines  in  the  inset)  and  the  second  (founded  on 
the  Max  operation;  dashed  green  lines  in  the  inset)  to  provide  invariance. 
Invariance  to  translation  is  achieved  by  pooling  over  afferents  tuned  to 


different  positions,  and  invariance  to  scale  (not  shown)  is  accomplished 
by  pooling  over  afferents  tuned  to  different  scales.  The  output  of  the 
view-based  module  is  represented  by  view-tuned  model  units  that 
exhibit  tight  tuning  to  rotation  in  depth  (and  other  object-dependent 
transformations,  such  as  illumination  and  facial  expression)  but  are 
tolerant  to  scaling  and  translation  of  their  preferred  object  view.  Notice 
that  the  cells  labeled  here  as  view-tuned  units,  encompass,  between  the 
anterior  IT  (AIT)  and  posterior  IT  (PIT),  a  spectrum  of  tuning  from  views  to 
complex  features:  depending  on  the  synaptic  weights  determined  during 
learning,  each  view-tuned  cell  becomes  effectively  connected  to  all  or 
only  a  few  of  the  units  activated  by  the  object  view  [20].  The  second  part 
of  the  model  starts  with  the  view-tuned  cells.  Invariance  to  rotation  in 
depth  is  obtained  by  combining,  in  a  learning  module,  several  view-tuned 
units  tuned  to  different  views  of  the  same  object  [1 3],  creating  view- 
invariant  units  ( On ).  These,  as  well  as  the  view-tuned  units,  can  then 
serve  as  inputs  to  task  modules  that  learn  to  perform  different  visual 
tasks  such  as  identification/discrimination  or  object  categorization.  They 
consist  of  same  generic  learning  circuitry  (similar  to  an  RBF  network 
[1 3])  but  are  trained  with  appropriate  sets  of  examples  to  perform 
specific  tasks.  In  addition  to  the  feed-forward  processing,  there  are  likely 
feedback  pathways  for  top-down  modulation  of  neuronal  responses 
throughout  the  processing  hierarchy  and  to  support  the  learning  phase. 
All  the  units  in  the  model  represent  single  cells  modeled  as  simplified 
neurons  with  modifiable  synapses. 


organization  of  two  stages  of  processing.  The  first,  extending 
from  VI  to  IT,  is  comprised  of  units  showing  the  same 
scale  and  position  invariance  properties  as  the  view-tuned 
IT  neurons  described  by  Logothetis  et  al.  [8]  using  the 
same  stimuli.  Computationally,  this  is  accomplished  by  a 


scheme  best  explained  by  taking  striate  complex  cells  as 
an  example:  invariance  to  changes  in  the  position  of  an 
optimal  stimulus  (within  a  range)  is  obtained  by  means  of 
a  maximum  (Max)  operation  performed  on  the  simple  cell 
inputs  to  the  complex  cells.  Both  simple  and  complex  cells 
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are  assumed  to  have  the  same  optimal  orientation  but  at 
different  positions.  The  key  idea  is  that  the  two  steps  — 
filtering  followed  by  a  Max  operation  —  are  equivalent  to 
a  simple  but  powerful  signal  processing  technique:  select 
the  peak  of  the  correlation  between  the  signal  and  a  given 
matched  filter  (here  the  correlation  is  over  either  position 
or  scale).  The  model  alternates  layers  of  units  combining 
simple  filters  into  more  complex  ones  with  layers  using  the 
Max  operation,  in  order  to  build  invariance  to  position  and 
scale  while  increasing  pattern  selectivity.  In  the  second 
part  of  the  architecture,  learning  from  multiple  examples, 
represented  by  view-tuned  units,  leads  to  view-invariant 
units,  as  well  as  neural  circuits  performing  specific  tasks. 
The  key  idea  here  is  that  interpolation  and  generalization 
can  be  obtained  by  simple  networks  that  learn  to  combine 
the  output  of  cells,  each  broadly  tuned  to  the  features  of  an 
example  image  [8,13].  Simple  learning  networks  of  this 
type  can  learn  to  identify  an  object  across  different  view¬ 
points  [13]  and  illuminations,  as  well  as  categorizing  objects 
across  exemplars  of  a  class  [19]. 

The  model  described  above  predicts  several  experimental 
results  and  provides  interesting  perspectives  on  still  other 
data  and  claims.  For  instance,  the  model  accounts  (see 
[14, 18*, 19, 20])  for  the  response  of  tuned  IT  cells  to  multiple 
objects  in  the  receptive  field  [21],  scrambled  objects  [22], 
cluttered  [23]  and  mirror  views  [2].  It  also  shows  a  degree 
of  performance  roughly  in  agreement  with  physiological 
and  psychophysical  data  obtained  from  specific  tasks. 
These  include  the  cat  versus  dog  categorization  task 
described  by  Freedman  et  al.  [24*],  object  identification, 
gender  classification  and  possibly  the  face  habituation 
effect  of  Leopold  etal.  [25],  as  well  as  the  effects  of  contrast, 
mirror  and  figure-ground  reversal  described  by  Baylis  and 
Driver  [26].  Preliminary  data  [27]  support  a  specific 
prediction  of  the  model  —  the  existence  of  a  Max-like 
pooling  operation  to  increase  invariance  (see  Figure  1). 

A  key  function  of  models  is  to  clarify  basic  issues  and  the 
interpretation  of  relevant  data.  In  the  following,  we  will 
use  the  model  shown  in  Figure  1  to  discuss  three  focal 
topics  of  recent  research  in  object  recognition:  the  feature 
tuning  of  neurons  in  higher  visual  areas,  the  nature  and 
organization  of  object  representation,  and  the  relationship 
between  identification  and  categorization  tasks. 

Selectivity 

Invariance  is  one  requirement  for  object  recognition,  the 
other  one  being  selectivity.  Several  studies  have  estab¬ 
lished  that  IT  neurons  can  become  tuned  to  task-relevant 
objects  and  their  views  [8, 28, 29*, 30*]  or  to  objects  in  the 
monkey’s  environment  [10],  suggesting  that  the  activity  of 
these  neurons  may  be  part  of  the  representation  of  objects 
occurring  in  an  animal’s  environment.  The  preferred  stimuli 
of  neurons  in  intermediate  stages  of  the  ventral  stream  are 
less  clear,  possibly  because  of  the  difficulty  of  knowing 
which  stimuli  to  use  to  probe  their  neural  selectivity. 
Reports  of  preferred  features  of  neurons  in  V4,  the  visual 


area  preceding  IT  in  the  ventral  pathway,  vary  depending 
on  the  set  of  stimuli  used  to  probe  responses,  including 
cartesian  gratings  [31],  polar  and  hyperbolic  sinusoidal 
gratings  [32],  and  contour  features  [33].  In  the  secondary 
visual  cortex  (V2),  a  recent  study  [34]  has  reported  neuronal 
preferences  to  complex  stimuli  such  as  arcs,  intersecting 
lines  and  non-cartesian  gratings. 

Instead  of  probing  neuronal  tuning  with  a  fixed  set  of 
stimuli,  another  set  of  studies  [1,3,35-37]  has  employed  a 
‘simplification  procedure’  in  an  effort  to  define  the  features 
crucial  to  activate  a  neuron.  In  this  approach,  a  complex 
natural  stimulus  (such  as  a  face)  to  which  the  neuron 
under  study  responds,  is  progressively  ‘simplified’  (e.g.  by 
removing  color  or  texture,  or  simplifying  complex  shapes 
into  simpler  geometric  primitives)  such  that  the  magnitude 
of  the  response  remains  the  same  as  that  elicited  by  the 
original,  unsimplified  object.  The  stimulus  that  cannot  be 
‘simplified’  further  without  decreasing  the  firing  rate  is 
then  defined  as  the  effective  stimulus  for  that  cell.  A  study 
using  this  paradigm  [3]  has  reported  an  increase  in  feature 
complexity  from  area  V2  to  anterior  IT.  However,  a  recent 
IT  optical  imaging  study  [37],  supported  by  single  cells 
recordings,  demonstrates  the  fundamental  difficulty  of 
determining  a  neuron’s  preferred  feature  in  higher  visual 
areas.  These  authors  report  that,  in  fact,  in  the  majority  of 
cases,  ‘simplifying’  a  stimulus  led  to  the  activation  of 
additional  IT  neurons  relative  to  the  original  ‘complex’ 
stimulus.  Interestingly,  the  model  described  in  Figure  1 
does  actually  qualitatively  predict  what  is  observed  — 
neurons  tuned  to  a  dictionary  of  features  at  different  levels 
of  complexity.  Moreover,  preliminary  simulations  suggest 
that,  for  IT  model  units,  the  effect  of  the  ‘simplification’ 
procedure  may  well  lead  to  the  observations  reported  by 
Tsunoda  etal.  [37]. 

Representation 

Related  to  the  issue  of  neuronal  tuning  is  the  question  of 
the  precise  nature  of  object  representation  in  cortex.  It  has 
recently  been  put  forward,  on  the  basis  of  a  set  of  human 
fMRI  studies,  that  some  object  classes  —  faces  [38],  places 
[39]  and  body  parts  [40#]  —  are  processed  by  distinct 
modules  in  cortex.  Another  fMRI  study  [4T]  has  shown  that 
objects  of  a  certain  class  (e.g.  faces)  evoke  a  distributed 
pattern  of  activity  that  is  not  confined  to  the  aforemen¬ 
tioned  specialized  modules  (e.g.  the  fusiform  face  area 
[FFA]  [38]),  and  that  activation  patterns  outside  a  specific 
module  are  sufficient  for  object  categorization.  Some  data, 
therefore,  appear  to  argue  for  a  ‘modular’  framework  of 
object  representation  in  cortex,  where  specific  brain  areas 
are  posited  to  perform  computations  unique  to  the  object 
class  at  hand.  Other  data,  however,  support  a  model  in 
which  objects  from  different  classes  are  represented  in  a 
distributed  fashion,  and  their  recognition  is  founded  on  the 
same  computations. 

The  model  represented  in  Figure  1  supports  the  latter 
claim.  Figure  2  helps  to  reconcile  the  two  sets  of  data. 
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Figure  2 


Tuning  of  a  model  face  unit.  The  unit  is  a  view- 
tuned  unit  as  the  Vn  shown  in  Figure  1 ,  tuned 
to  the  leftmost  face  on  the  bottom  axis.  The 
blue  line  shows  the  unit’s  response  changes 
as  the  stimulus  is  gradually  morphed  away 
from  the  preferred  stimulus  to  another  face 
(along  the  axis).  The  unit’s  response  changes 
gradually  with  changes  in  the  stimulus, 
permitting  subordinate  level  discrimination 
(especially  when  using  a  population  code 
consisting  of  several  units  tuned  to  different 
representatives  of  the  class  [46]).  The  same 
unit  also  responds  to  the  animal  stimuli  shown 
on  the  right  (green  crosses),  but  at  a  lower 
level  than  to  the  faces.  This  permits  a  coarse 
categorization  of  a  stimulus  as  an  animal 
stimulus,  on  the  basis  of  the  face  unit’s  firing 
[41  •].  Units  such  as  these  can  form  the  basis 
of  a  categorization  circuit  [1 9].  Face  images 
courtesy  of  T  Vetter  [58]. 


Model  IT  units  have  preferred  afferent  activation  patterns 
that  can  represent  a  full  or  partial  view  of  an  object  [20].  At 
the  highest  levels  of  the  model,  the  tuning  ranges  from  the 
one  described  by  Tanaka  [1]  to  the  view-tuning  described 
for  paperclips  and  faces.  Contrary  to  some  claims  [17],  both 
the  model  and  the  experimental  data  [42**, 43]  suggest 
that  faces  are  not  special.  In  fact,  the  model  predicts  the 
existence  of  neurons  tuned  to  moderately  complex  features 
and  cells  tuned  specifically  to  views  of  objects,  depending 
on  task  training  and  difficulty  [44].  In  doing  so,  it  suggests 
that  the  distinction  between  ‘complex  features’  and 
‘objects’  is  largely  semantic:  during  training,  a  cell  may 
well  become  tuned  to  a  feature  that  is  diagnostic  for  the 
object  rather  than  to  a  full  view,  depending  on  the 
specificity  and  number  of  its  afferents.  What  is  relevant  for 
object  recognition  is  that  the  objects  to  be  discriminated 
produce  distinct  activation  patterns. 

From  a  computational  point  of  view,  groups  of  neurons 
responding  to  representatives  from  different  object  classes 
do  not  have  to  be  segregated,  but  are  likely  to  be  inter- 
digitated.  Moreover,  the  same  neuron  can  respond  to 
objects  from  different  classes,  depending  on  their  visual 
similarity  (Figure  2).  Because  the  activity  of  one  fMRI 
voxel  is  typically  the  average  of  hundreds  of  thousands  of 
neurons,  a  strong  activation  of  the  FFA  for  faces  would 
argue  for  a  higher  density  of  face  neurons  in  that  part  of 
cortex,  perhaps  owing  to  the  great  cognitive  importance 
of  faces.  (For  cautionary  notes  about  the  interpretation  of 
fMRI  images  see  [45**].)  However,  subjects  with  substantial 
expertise  for  other  object  classes  could  be  expected  to 
have  a  greater  number  of  neurons  tuned  to  objects  from 


their  field  of  expertise  [46],  and  correspondingly  might 
show  significant  activation  of  the  same  cortical  regions  for 
these  objects.  Indeed,  in  bird  and  car  experts,  brain  areas 
overlapping  with  the  FFA  have  been  found  [47]  to  be 
specifically  activated  by  birds  and  cars,  and  subjects 
trained  to  recognize  objects  from  a  novel  class  of  objects 
(‘greebles’)  showed  activation  of  the  FFA  by  the  training 
objects  [48*]. 

What  is  the  mechanism  that  permits  the  usage  of  similar 
neural  circuits  for  representing  objects  as  diverse  as  faces, 
birds  and  cars?  The  architecture  and  operational  principles 
of  our  model  offer  one  putative  mechanism  (for  detailed 
computational  simulations,  see  [19,46]).  A  particular 
object,  say  a  specific  face,  will  elicit  different  activities  in 
the  view-specific  Vn  and  object-specific  On  cells  of  Figure  1 
(an  example  of  which  is  shown  in  Figure  2).  Thus,  the 
memory  of  the  particular  face  is  represented  in  an  implicit 
way,  by  a  sparse  population  code  through  the  activation 
pattern  over  the  coarsely  tuned  Vn  and  On  cells. 
Discrimination,  or  memorization  of  specific  objects,  can 
then  proceed  by  comparing  activation  patterns  over  the 
strongly  activated  object-tuned  or  view-tuned  units  [46] 
tuned  to  a  small  number  of  ‘prototypical’  faces  [49].  For  a 
certain  level  of  specificity,  only  the  activations  of  a  small 
number  of  units  have  to  be  stored,  forming  a  sparse  code. 
This  is  in  contrast  to  activation  patterns  at  lower  levels, 
where  units  are  less  specific  and  hence  activation  patterns 
tend  to  involve  more  neurons.  In  a  similar  fashion,  neural 
circuitry  for  categorization,  located  putatively  in  the  PFC 
[24*],  can  be  trained  [19]  to  receive  input  from  relevant 
object-tuned  units.  For  instance,  a  unit  that  categorizes 
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cats  versus  dogs  would  receive  input  from  units  responding 
to  some  individual  cats  and  dogs.  This  is  in  line  with  a  very 
recent  finding  that  PFG  units  show  more  category  tuning 
than  IT  neurons,  in  a  macaque  trained  to  categorize  cats 
and  dogs  [50]. 

In  conclusion,  the  model  depicted  in  Figure  1  suggests 
that  the  same  basic  circuitry,  replicated  many  times  in  IT, 
can  learn  from  visual  experience  to  represent  and  recognize 
different  types  of  objects.  Computations  do  not  need  to  be 
fully  hardwired  by  genes  and  specialized  for  specific  classes 
of  objects. 

Levels  of  recognition 

An  object  can  be  recognized  at  different  levels  —  a  face 
can  be  recognized  as  a  face,  but  also  more  specifically  as 
‘a  male  face’,  ‘Tommy  Poggio’s  face’  or  ‘Tommy  Poggio’s 
smiling  face’.  It  has  been  common  in  cognitive  science  to 
assume  that  recognition  of  an  object  at  different  levels 
relies  on  different  computational  mechanisms  [51,52].  In 
particular,  it  has  been  proposed  that  ‘subordinate  level’ 
recognition  (identification)  is  derived  from  ‘configura¬ 
tional’  judgements,  whereas  ‘basic  level’  categorization 
(a  face?  a  dog?  a  car?)  relies  on  a  qualitative  representation 
formulated  on  the  presence  or  absence  of  features. 

However,  as  Figure  1  makes  clear  and  as  we  have  pointed 
out  earlier  [18*],  all  supervised  recognition  tasks  —  in 
which  the  subject  is  trained  with  labeled  examples  —  are 
identical  from  a  computational  point  of  view:  they  all 
involve  a  classification  established  on  positive  and 
negative  exemplars.  Indeed,  it  is  not  clear  why  different 
computations  should  be  required  to  recognize  a  face  at  the 
subordinate  level  or,  for  example,  to  determine  its  gender. 
In  fact,  it  is  worth  noticing  that  the  basic  radial  basis 
function  (RBF)  network  [8,13]  replicated  at  different  levels 
in  Figure  1  (e.g.  from  view-tuned  to  view-invariant  units), 
can  learn  to  perform  different  tasks  from  the  same  set  of 
training  images.  For  instance,  units  tuned  to  distinct 
expressions  of  a  face  can  feed  into  an  identification  unit 
that  responds  to  a  specific  face;  the  same  units  can  also  be 
used  with  different  synaptic  weights  by  an  expression  unit 
that  has  learned  to  respond,  say,  to  smiling.  In  line  with  the 
model  in  Figure  1,  recent  findings  indicate  that  the  FFA  is 
involved  not  just  in  subordinate  level  face  recognition  but 
also  in  face  detection  [53],  arguing  against  a  specialization 
of  brain  areas  for  recognition  tasks,  such  as  subordinate 
level  recognition  independent  of  object  class.  The  problem 
with  many  experiments  investigating  the  relationship 
between  categorization  and  identification  that  claim  an 
advantage  of  basic  level  recognition  over  subordinate  level 
recognition,  is  that  the  tasks  used  for  the  different  recogni¬ 
tion  levels  are  of  differing  difficulties.  Discriminating  a 
face  from  a  chair  (categorization)  is  a  much  easier  task  than 
discriminating  between  two  faces  (identification),  as  the 
latter  are  more  similar  to  each  other.  Assuming  that 
physically  similar  stimuli  produce  similar  neuronal  activation 
patterns,  and  that  the  ability  to  discriminate  between  two 


stimuli  requires  a  certain  level  of  evidence  (in  the  form  of 
firing  rate  differences),  a  finer  discrimination  would 
require  the  accumulation  of  evidence  over  a  longer  time 
period.  A  prediction  would  be  that  if  categorization  and 
identification  tasks  were  equalized  in  terms  of  difficulty, 
they  would  take  a  similar  amount  of  time.  From  the  point 
of  view  of  the  model  represented  in  Figure  1,  the  two  tasks 
are  computationally  equivalent  and  can  be  learned  with 
equal  ease. 

Conclusions  and  future  directions 

Most  of  the  old  and  new  data  on  object  recognition  in 
cortex  can  be  summarized  and  interpreted  in  a  quantitative 
and  consistent  way,  by  a  simple  hierarchical,  mostly  feed¬ 
forward  architecture  as  shown  in  Figure  1.  Of  course,  many 
aspects  of  how  object  recognition  is  performed  are  left 
open  by  simple  models  of  this  kind.  Furthermore,  future 
experiments  may  require  modifications  and  extensions  of 
this  model  and  still  others  may  falsify  significant  parts  of  it. 
For  instance,  data  on  the  neural  correlates  of  border  ownership 
in  V2  [54*]  are  hard  to  incorporate  in  feed-forward  models, 
especially  if  they  hold  true  for  natural  scenes.  In  any  case, 
the  road  ahead  will  require  close  interactions  between 
experimental  and  computational  work. 
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